Cross-biome microbial networks reveal functional redundancy and suggest genome reduction through functional complementarity

The structure of microbial communities arises from a multitude of factors, including the interactions of microorganisms with each other and with the environment. In this work, we sought to disentangle those drivers by performing a cross-study, cross-biome meta-analysis of microbial occurrence data in more than 5000 samples, applying a novel network clustering algorithm aimed to capture conditional taxa co-occurrences. We then examined the phylogenetic and functional composition of the resulting clusters, and searched for global patterns of assembly both at the community level and in the presence/absence of individual metabolic pathways. Our analysis highlighted the prevalence of functional redundancy in microbial communities, particularly between taxa that co-occur in more than one environment, pointing to a relationship between functional redundancy and environmental adaptation. In spite of this, certain pathways were observed in fewer taxa than expected by chance, suggesting the presence of auxotrophy, and presumably cooperation among community members. This hypothetical cooperation may play a role in genome reduction, since we observed a negative relationship between the size of bacterial genomes and the size of the community they belong to. Overall, our results suggest the microbial community assembly is driven by universal principles that operate consistently across different biomes and taxonomic groups.


Aggregation scores
In Pascual-García et al. (2014), we defined the bare aggregation score between two taxa i and j as minus the logarithm of the probability P(n ij , M ) that they co-occur at n ij locations out of M under the null model.We performed an iterative computation, letting the number of locations vary from m=0 to M .The initial condition is P(n ij =0 ,0)=1, P(n ij =1 , 0)=0, and we update the probabilities as P(n ij , m)=P(n ij , m−1) ( 1−π im π jm ) + P(n ij −1 , m−1) ( π im π jm ) .
However, the probability P(n ij , M ) is small for rare taxa whose taxon-specific parameter p i is small, which tends to overestimate their aggregation scores.In order to reduce false positives, although at the possible expense to increase false negatives, here we compute the conditional probability to observe the companion taxon j given that the rare conditioning taxon i is present.We relabel i and j at each location such that p i ( A)< p j ( A ), i.e. the conditioning taxon i embodies the condition that is most difficult to fulfil, with the effect to increase the conditional probability.We compute the conditional probability of n ij co-occurrences conditioned to the observed distribution of taxon i at all locations, denoted as { X ia } , and we define the aggregation score between i and j as minus the probability that n ij is equal or larger than the observed value, conditioned to the observed values of { X ia } : The probability of taxon i at the denominator of the conditional probability is given by a lik (X ia ) , i.e. it is the product of the likelihoods lik (X ia ) of the observed presence and absences of taxon i in each sample a.We compute the numerator of the conditional probability iteratively as For computational convenience, we compute one minus the probability that n ij is smaller than the observed value, which can be computed faster.Likewise, we define the segregation score as minus the logarithm of the conditional probability that n ij is equal to or smaller than the observed value.

Decoupling aggregation scores from cosmopolitanism
The mean aggregation score of a taxon is correlated with the number of samples in which the taxon is present, which we denote as its cosmopolitanism.To reduce this correlation, we transform the bare aggregation scores into Z-scores as follows.With the null model of the observed matrix we extract 100 random matrices, we compute their null model and, through it, we compute the bare scores S ij for all pairs using the random matrix as it were the observed one.Finally, we obtain mean and standard deviation of these simulated scores S ij and we use them to transform the observed score into a Z-score, Z ij =(S ij obs −S ij ) /σ (S ij ), where S ij obs are the aggregation score obtained from the real data, and S ij , σ (S ij ) are the mean and standard deviations of aggregation scores in the simulated data with a similar minimum cosmopolitanism as that of the real pair, obtained from a cubic spline fit.We transform the aggregation scores of the simulated data to Z-scores in a similar fashion, yielding a distribution of null Z-scores.

Thresholds
We use the distribution of null Z-scores obtained in the previous step to determine a Z-score threshold such that only 0.01% of the null Z-scores are higher than it (False Positive Rate = 0.0001).We then subtract this cut-off from the Z-scores obtained from the real observations in order to generate corrected Z-scores.Any pair of taxa with a positive corrected Z-score is deemed to significantly aggregate in our samples.

Supplementary Note 2: Comparison of genus-level core predictions using genomes from isolates and Metagenome-Assembled Genomes (MAGs) for different environments
This work relies heavily on a functional inference strategy in which the metabolic pathways associated to a given microbial genus are predicted from the functional annotations of all of the sequenced genomes from that genus.Similar functional inference strategies have been successfully applied before, with the most prominent example being the PiCRUST tool for the prediction of metagenome functions from marker genes (e.g.16S rRNA gene) profiles (Douglas et al., 2020), with 2942 citations at the time of writing this note (Google Scholar, accessed on 2024-04-05).However, there is a concern that this may be a source of biases, since a vast majority of the high quality reference genomes used as the basis for functional inference come from isolates, which may fail to capture the true environmental and functional diversity of the taxa whose functions are being inferred (Albright & Louca, 2023).An alternative to this would be to reconstruct Metagenome-Assembled Genomes (MAGs) from shotgun metagenomic sequences, in an attempt to capture the true functions associated to each taxa in the samples of interest without resorting to functional inference.MAGs are however often incomplete and can be biased due to sequencing depth and false positive effects (Browne et al., 2009;Meziti et al. 2021;Nunes de Rocha et al., 2023).
To alleviate potential biases due to the use of genomes from isolates for functional inference, we restricted our analysis only to core genomes, that is, the metabolic pathways that were predicted in all of the available genomes from each genus.This restricts our analysis only to pathways that are conserved at the genus level, but is more robust as those are expected to be more conserved, regardless of whether the genomes used for inference are MAGs or isolates.To test these expectations, we sought to evaluate whether genus-level core predictions were similar when calculated using isolate genomes or MAGs.We chose to focus in the genus Pseudomonas, since 1) it has been extensively characterized, with thousands of genomes from hundreds of different species being available in public databases (Hesse et al., 2018).
2) it is present in a wide range of environments (Spiers et al., 2000;Silby et al., 2011), and thus a good target to evaluate whether functional inference is affected by environmental biases.
3) it is metabolically versatile, with individual strains harboring mosaic genomes with customized genomic repertoires (Mathee et al., 2008;Silby et al., 2011).This complexity makes Pseudomonas a worst-case scenario for genus-level functional inference based on incomplete information, as inter-genome variability may result in different results depending on the number of genomes used for core genome estimation.
We downloaded 14464 high quality annotated Pseudomonas genomes from the proGenomes3 database (Fullam et al., 2023).Those genomes had been annotated with KEGG Orthology terms using a common pipeline, and grouped into species-level clusters, which were then classified according to their occurrence into different habitats.Amongst these genomes, 14221 belonged to isolates, and 243 to high quality MAGs (CheckM: completeness > 90% and contamination < 5%; Parks et al., 2016;Fullam et al., 2023).We then filtered out those genomes annotated with less than 1500 independent KO terms, which removed 79 genomes from isolates.
The resulting distribution of isolate genomes and MAGs across the different habitats considered in proGenomes3 is presented in Supplementary Table SN2.1).Genome size (measured as the number of unique KOs in the genome) was consistently smaller on MAGs than on isolate genomes (Supplementary Table SN2.1;Supplementary Figure SN2.1), with the size of MAGs being on average 83% of that of isolate genomes.This is noticeably smaller than the 90% MAG completeness threshold used in proGenomes3; it is unclear whether this obeys to a real biological cause (e.g.larger genomes being on average easier to isolate) or to errors in bioinformatics pipelines resulting in "high quality" MAGs that nonetheless are missing a larger fraction of genomic information than expected, a known issue further discussed in Meziti et al. (2021).